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Abstract — This paper derives new entropy bounds for discrete 
random variables via maximal coupling. It provides bounds on 
the difference between the entropies of two discrete random 
variables in terms of the local and total variation distances 
between their probability mass functions. These bounds address 
cases of finite or countable infinite alphabets. Particular cases 
of these bounds reproduce some known results. The use of the 
new entropy bounds is exemplified by relying on some bounds on 
the above distances via Stein's method. The improvement that is 
obtained by these bounds is exemplified. 

Index Terms — Coupling, entropy, local distance, Stein's 
method, total variation distance. 



I. Introduction 

Inequalities that relate the Shannon entropy or information 
divergence with the total variation distance were extensively 
studied during the last fifty years (see, e.g., ifTl- lfTOl . ifPJl . 

ma, m, ed-ed, ma-nzo, ed-ed, ma, ma-ma). 

Among the observations in these works, it is known that a 
sufficiently small total variation distance between a pair of 
discrete random variables with a finite and fixed alphabet, 
implies a small difference between their entropies. However, 
if the size of the alphabet is finite but it is not bounded then 
for an arbitrarily small 6 > and an arbitrarily large fj, > 0, 
there exists a pair of discrete random variables such that the 
total variation distance between them is less than S whereas 
the difference between their entropies is larger than p, (see, 
e.g., 12T1 Theorem 1] with a concrete example in its proof). 

The interplay between the entropy difference of two discrete 
random variables and their total variation distance was studied 
in Theorem 17.3.3] or JJO] Lemma 2.7], HU Lemma 1], 
ED, El, El Section 2] and J49). The bounds that are 
derived in this work improve some existing bounds as a result 
of their dependence on both the local and total variation 
distances and the alphabet sizes (the relevant distances are 
defined later in this section). The new bounds are derived via 
the use of maximal coupling, which is also known to be useful 
for the derivation of error bounds via Stein's method (see, e.g., 
Il40l Chapter 2] and ED). It is noted that the entropy bounds 
in ||49l are also derived via coupling, but the approach of the 
analysis in this work is remarkably different (see Sections [TT1 
andHITb. The new bounds are linked to Stein's method, and the 
improvement that is achieved by these bounds is exemplified. 

We provide in the following the essential mathematical 
background that is required for the analysis in this work. 

Definition 1: A coupling of a pair of two discrete random 
variables (X, Y) is a pair of two random variables (X, Y) such 
that the marginal distributions of (X, Y) and (X, Y) coincide, 
i.e., P x = P x and P Y = P y . 



Definition 2: For a pair of random variables (X, Y), a 
coupling (X, Y) is called a maximal coupling if F(X = Y) 
is as large as possible among all the couplings of (X,Y). 

The following theorem is a basic result on maximal coupling 
that also suggests, as part of its proof, a construction for 
maximal coupling. We later rely on this particular construction 
to derive in Section [ill] some new bounds on the entropy of 
discrete random variables. Hence, the proof of the following 
known theorem serves for the analysis in this work. 

Theorem 1: Let X and Y be discrete random variables that 
take values in a set A, and let their respective probability mass 
functions be 

P x (x)=F(X = x), P Y {y)=F(Y = y), Vx,yeA. 
Then, the maximal coupling of (X, Y) satisfies 

F(X = Y)= > .nnn|/\w/i./\ (,/)}. (1) 



^mm{Px(u),Py(u)}. 

uEA 



Proof: Let B = {u e A : P x {u) < P Y (u)}, and let 
B c = A \ B. Then, for every coupling (X, Y) of (X, Y), 

¥(X = Y) 

= ¥{X = Y eB)+ ¥(X = Y eB c ) 
<P(X e B)+P{Y e B c ) 
= F(X G B) +p(y e B c ) 



uEB 



J2^{Px(u),P Y (u)}+ ™in{Px{u),P Y (u)} 



uEB 



uEB c 



J2^HPx(u),P Y {u)}^ 



(2) 



uEA 



The following provides a construction of a coupling (X, Y) 
that achieves the bound in (f5|i with equality, so it forms 
a maximal coupling of (X,Y). Let U, V, W and J be 
independent discrete random variables, where 

P(J = 0) = l-p, P(J=l)=p (3) 

so J ~ Bernoulli (p), and let U, V, W have the following 
probability mass functions: 

mm{P x (u),P Y (u)} 



Pu{u) = 
Pv(v) = 
Pw{w) 



P 



VueA 



P x (v)-mm{P x {v),P Y {v)} 
1-p 

P Y (w) — mm{Px(w), P Y (w)} 
= ~ ~P 



(4) 

VveA (5) 
, Vw€A. (6) 
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If J = 1, let X = Y = U, and if J = let X = V and 
Y = W. For every x, y G A 

= pF(X = x | J = 1) + (1 - p) F(X = x | J = 0) 
= pP u {x) + {l-p)P v {x) 
= Px(x) 

and similarly P-Cr(y) — Py(y), so (X, Y) is indeed a coupling 
of (X, Y). Furthermore, 

F(X = Y) > P( J = 1) = p (7) 

so, from (O and ©, it follows that the proposed construction 
for (X,Y) is a maximal coupling of (X,Y). ■ 
Definition 3: Let X and V be discrete random variables 
that take values in a set A, and let Px and Py be their 
respective probability mass functions. The local distance and 
total variation distance between X and Y are, respectively, 

Py(u)\ (8) 



d l0C (X,Y) 4 su P |Px(u) 
dTv(X,y)4i £|p x („)-iV(«)|. 



(9) 



«e.4 



Without abuse of notation, one can also write d\ 0C {Px,Py) 
and d-Yy(Px , Py), respectively. 

Remark 1: The factor of one-half on the right-hand side 
of © normalizes the total variation distance to get values 
between zero and one. It is noted that the notation in the 
literature is not consistent, with a factor 2 on the right-hand 
side of (O often being present or not. It is easy to show (see, 
e.g., H3 Lemma 5.4 on pp. 133-134]) that 

d TV {X,Y) = sup \¥{X G B) — P(y G B)\. 

BCA 

From the last equality and the definition of the local distance 
in ©, it follows that d Xoc (X,Y) < d TV (X,Y). 

The following result is a simple consequence of TheoremQ] 
and it is also used for the derivation of the new bounds on the 
entropy in Section fUTl 

Theorem 2: Let X and Y be two discrete random variables 
that take values in a set A. If (X, Y) is a maximal coupling 
of {X, Y) then 

F(X ^Y) = d TV {X,Y). (10) 

Proof: This follows from (Q~|i and (O, and the equality 
min{a, b} = a+fe ~ |a ~ b| for a, b G R. ■ 
The continuation of this paper is structured as follows: 
Section ITT1 provides a simple proof, via maximal coupling, for 
an existing bound on the difference between the entropies of 
two discrete random variables in terms of their total variation 
distance (see f2~Tl Theorem 6] and 11491 Eq. (4)]). The proof 
of this bound is a shortened version of the proof in |49|, 
and it serves to motivate the derivation of some refined 
bounds on the difference between the entropies of two discrete 
random variables. These new bounds, proved in Section lllll via 
maximal coupling, depend on both the local and total variation 
distances. Section |IV] exemplifies the use of the new bounds 
with a link to Stein's method, and it also compares them with 
some existing bounds. 



II. A Proof of a Known Bound on the Entropy of 
Discrete Random Variables via Coupling 

The following theorem relies on a bound that first appeared 
in l49l Eq. (4)] and proved by coupling. It was later introduced 
in ET1 Theorem 6] by re-proving the inequality in a different 
way (without coupling), and it was also strengthened there by 
showing an explicit case where the following bound is tight. 
As is proved in |49l Section 3], the bound on the entropy 
difference that is introduced in the following theorem improves 
the bound in [7, Theorem 17.3.3] or iflOl Lemma 2.7]. 

Theorem 3: Let X and Y be two discrete random variables 
that take values in a set A, and let \ A\ = M. If d TV (X, Y) < e, 
then 



elog(M- l) + h(s) if ee [0, 1 



M ] 



if £ > 1 - 



\H(X)-H(Y)\ < 

{ log(M) 

where h denotes the binary entropy function. Furthermore, 
there is a case where the bound is achieved with equality. 

The following proof of Theorem [3] exemplifies the use of 
maximal coupling in proving an information-theoretic result. 

Proof: Let (X,Y) be a maximal coupling of (X,Y). 
Since H(X) = H(X) and H(Y) = H(Y) (note that the 
marginal probability mass functions of (X, Y) and (X, Y) are 
the same), it follows from Fano's inequality and Theorem [2] 
(see ((Toll) that 

\H(X)-H(Y)\ 

= \H(X)-H(Y)\ 

= \H(X\Y)-H(Y\X)\ 

< max{H(X\Y),H(Y\X)} 

< P(X j= Y) log(M - 1) + h(¥(X j= Y)) 
= d TV (X,Y) log(M -l) + h(d TV (X,Y)). 

This proves the bound in EU Eq. (4)]. If d TV (X,Y) < e 
for some e G [0, 1 — -A?], the replacement of diy/(X 1 Y) in 
the last bound by s is valid; this holds since the function 
f(x) = x log(M — 1) + h(x) is monotonic increasing over the 
interval [0, 1-^] (since f'(x) = log(M-l)+log > 

for < x < 1 - jt). Otherwise, if e > 1 — ±, then 



\H{X) - H(Y)\ < max{H(X),H(Y)\ < log(M). 

Cases where the bound is tight H21V : If e G [0, 1 — i], the 
bound is tight when 



X~Px=[l-e 



M— l'""'M — 1 
K~iV = (l,0,...,0) 

which implies that 

d TY (X 1 Y)=e, 

\H(X) - H(Y)\ = H(X) = h(e)+e\og{M - 1). 
If e G (1 — -A?, 1] then the bound is tight when 

X~f— V F~ (1,0,. ..,0) 

VM' ' M J v ' ' ' ' 

so, djviX, Y) = 1 - ± < e and \H(X) - H(Y)\ = log(M). 
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III. New Bounds on the Entropy of Discrete 
Random Variables via Coupling 

In the cases where the known bound in Theorem [3] was 
shown to be tight in 11211 (see the last part of the proof in 
Section |n|, it is easy to verify that the local distance is equal 
to the total variation distance. However, as is shown in the 
following, if it is not the case (i.e., the local distance is smaller 
than the total variation distance), then the bound in Theorem[3] 
is necessarily not tight. Furthermore, this section provides 
new bounds that depend on both the total variation and local 
distances. If these two distances are equal then the new bound 
is particularized to the bound in Theorem [3] but otherwise, the 
new bound improves the bound in Theorem [3] The general 
approach for proving the following new inequalities relies on 
the construction of the maximal coupling that is introduced in 
the proof of Theorem Q] The new results are stated and proved 
in the following. 

Theorem 4: Let X and Y be two discrete random variables 
that take values in a set A, and let \A\ — M. Then, 

\H(X)-H(Y)\ 

< d JV (X, Y) log(Ma - 1) + h(d TV {X, Y)) (11) 



where 



dioc{X,Y) 



(12) 



djy{X, Y) 

denotes the ratio of the local and total variation distances 
(so, a € [jj, 1]), and h denotes the binary entropy function. 
Furthermore, if the probability mass functions of X and Y 
satisfy the condition that i < < 2 whenever Px , Py > 0, 
then the bound in ( lilt is tightened to 

\H(X)-H(Y)\ 

<d JV (X,Y) logj ^' 1 ) +h(d TV (X,Y)). (13) 



Remark 2: Since, in general, a < 1 then the case where 
a = 1 is the worst case for the bound in ( fTTT i. In the latter 
case, it is particularized to the bound in Theorem [3] (see [21 
Theorem 6] or [49, Eq. (4)]). 

Remark 3: If a < i for some integer N (since a £ [jj, l] 
then TV € {1, ... , [#J}), me bound in CLD implies that 



\H{X)-H{Y)\ 
<d TV (X,Y) log 



f M -N 



\+h((kv(X,Y)). (14) 



The bounds in ( TT~4T > and [21, Theorem 7] are similar but they 
hold under different conditions. The bound in lETl Theorem 7] 
requires that Px , Py < jj everywhere, whereas the bound in 
( TBI holds under the requirement that the ratio a of the local 



and total variation distances satisfies a < 



N 



None of these 



conditions implies the other. 

We prove in the following Theorem [4] 

Proof: Assume without loss of generality (w.o.l.o.g.) that 
H(X) — H(Y) > (note that there is a symmetry between 
X and Y in \H(X) - H(Y)\, d Xoa (X,Y) and d TV {X,Y)). 



Let (X,Y) be the maximal coupling of [X, Y) according 
to the construction in the proof of Theorem Q] Then, 

\H{X)-H{Y)\ 
= H{X) - H(Y) 
= H(X) - H(Y) 

= H{X\J) - H(Y\J) + I(X; J) - I(Y; J). (15) 
The conditional entropy H(X\J) satisfies 
H(X\J) 

= P(J = 0) H(X\J = 0) + P(J = 1) H(X\J = 1) 

d TV (X, Y) H(V\J = 0) + (l - d TV (X, Y)) H(U\J = 1) 



(a) 



( =' d TV {X, Y) H(V) + (1 - d TV (X, Y)) H(U) (16) 
where equality (a) holds since J ~ Bernoulli(p) with 

p = P(J = 1) = F(X = Y) = 1- djy (X, Y) 

(see the proof of Theorem Q] and the result in Theorem |2]), 
and because X is equal to V or U when J gets that values 
zero or one, respectively. Furthermore, equality (b) holds 
since U, V, W, J are independent random variables (due to the 
construction shown in the proof of Theorem [TJ. Similarly, 

H(Y\J) = d TV {X,Y) H{W) + (l-d TV (X,Y)) H{U). (17) 

Combining (fl5ll-(fT7]i yields that 

\H(X)-H(Y)\ 

= d TV (X, Y) (H(V) - H{W)) + I(X- J) - I(Y; J). (18) 



From © and ©, it follows that 

Pv(a)P w (a) = 0, VaeA 
and also, for every a G A, 

Pv{a) + P w (a) 

Px{a) + Py[a) - 2min{P x (a),Py{a)} 



(19) 



d T y(X,Y) 



= \Px{a)-Py{a) 
djv{X,Y) 
. d l0C {X,Y) A 

= (1 

~ d T y(X, Y) 



(20) 



In the following, we derive upper bounds on H(V) — H(W) 
and I(X; J) — I(Y; J), and rely on ( fT8l to get an upper bound 
on \H(X) - H(Y)\. Let ^ = {01,... ,0m}, and 

Si ±Pv(ai), U=Pw{ai), V*G{1,...,M}. 

From Q9) and 

Siti=0, Si + ti<a, Vie {!,..., M} 



and H(V) - H(W) = - ^Zi s * l ^(si) + Ef=i U log(U) 
Hence, for fixed a and M (since \A\ = M, then a € [tt, 1]) 



H(V) - H(W) < g(a) 



(21) 
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where g(a) is the solution of the optimization problem 

(M M \ 

- 53 s i log(si) + y^ji iog(ti) 
i=\ i=l ) 

subject to 

Si,U > 0, Si + U < a 

SiU = 0, Vi e {l,...,M} 

M Af (22) 

53 «< = X! ** = 1 

i=i i=i 

with the 2M variables s\,t\,... sm, tja- Fortunately, this non- 
convex optimization problem admits a closed-form solution. 

Lemma 1: The solution of the non-convex optimization 
problem in ( 1221 ). denoted by g(a), is the following: 



g(a)=log(M- - ) 



+a 





1 




1 




—J log a + ( 1 — a 




\ log i 1 - a 




) 




-01- 


.a- 





(23) 



with the convention that log means 0. 

Proof: Lets first show that the solution on the right-hand 
side of ( |23T > forms an upper bound on g(a), and then show 
that this upper bound is tight. 

For the derivation of the upper bound, note that due to the 
above constraints, 



M 



(a) 



l = 5^*i < " |{*G {1,...,M} : U > 0}| 



=> \{ie{l,...,M}:ti>0} >- 

^ \{ie{l,...,M}:si>0}\<M-- 
1 1 a 



(<=) 



\{i e {!,... ,M} : Sl > 0}| < M 



(24) 



where inequality (a) holds since s j + i j < a and , > 
for every i £ {1, . . . , M}, (b) follows from the constraint that 
Si ti — for every i, and (c) holds since the cardinality of 
the support of {si} is an integer, and I M — —J = M — [^1 . 
Hence, 

M 



^s, log(si) <log(M- i ) 



and the solution of the optimization problem in d22"i i satisfies 

rir 



g(a) <log M 



/(«) 



(25) 



where /(a) solves the optimization problem 



maximize Vjtjlog(ti) 

subject to 

0<U<a, Vie{l,...,M} 



A/ 



i=i 



i 



(26) 



with the M optimization variables ti,. . . ,tM- Note that the 
objective function in (|26| | is convex, and the feasible set is a 



bounded polyhedron. Furthermore, the maximum of a convex 
function over a bounded polyhedron is attained at one of its 
vertices (see, e.g., ||39l Corollary 32.3.3]; this property follows 
from the convex-hull description of a bounded polyhedron 
and Jensen's inequality). Since the objective function and 
the feasible set in (l26l i are invariant to a permutation of the 
variables t\, . . . , tja , then an optimal point is given by 



ti = . . . = ti = a, I = 



1 — a 



hi 



tl+2 = ■ ■ ■ = tM = 



where I < 4f (since a G [j^, 1]), and indeed ti £ [0, a] for i e 
{1, . . . , M}. This implies that the solution of the optimization 
problem in (l26b is given by 



/(«) 



a 



loga- 



1 - a 



From d25l l and ((27), it follows that the right-hand side of (|23l 
forms an upper bound on g(a). It remains to show that this 
bound is tight, to this end, we separate into the following two 

cases: 

Case 1: Suppose that N = — is an integer. In this case, the 
upper bound on g(a) (see the right-hand side of ( 1231 ) gets the 
simplified form 

g(a) < log^M - + log a = log(Ma - 1). 

This upper bound on g(a) is achieved by the point 

(s!,^, . . . ,s M ,t M ) where 



t\ = ...=% = OS, tjV+1 — • • • — tjtf — 
Sl = • • • = Sjv = 0, Sjv+1 = ■•■ = % 



1 



M — N 

Note that this point is included in the feasible set of the 

< a where 



optimization problem in d22"i i since M \ N 

2 



Ma-1 



the last inequality holds because agjA, 1], and the value of 
the objective function in d22l) at this specific point is equal to 



^2 Si log(sj) + U log(ti) 



i=l 



= log M - - 



log a = log(Ma — 1) 



so this upper bound on 17(a) is tight if ^ is an integer. 
Case 2: Suppose that — is not an integer. In this case, let 



and consider the (2M)-dimensional 



vector (si,ti,...,SM, tn) where 



ti = ... : 

tl+2 = ■ ■ 
Si = . . . 

Sj+2 = • • 



ti = a, ti+i 
= t M =0 
-- s l+1 = 

= sm 



! - a 



(28) 



1 



M-l-1 



M 
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To verify that it is included in the feasible set of ( 1221 . note 
that due to the constraints of this optimization problem 



M 



1 = ^2si<a \{i G {1,...,M} : Sl > 0}| 



i=i 



|{te{l,...,Af}:»i>0}|>- 



\{i G {1,...,M} : Si > 0}| > 
and, by combining it with (|24j, it follows that 



"1" 






< 









"1" 


< M - 









so [i] < ^. This implies that for j e {l + 2,...,M} (note 
also that a G 1]) 



M-l-1 
1 

2 



< 
- M 

< a 



and ti + i = 1 — a I < a, so the vector is indeed included 
in the feasible set of d22l . The value of the objective function 
in ( 1221 at the selected point in (l28l l is equal to 



A/ 



- s t log(sj) + log(ti) 

i=l i=l 
- l0g ( A/ -^) 



= 5(a) 



- I loga 





1 


j log ( 1 - a 


1 




a 






) 








-a. 





so the upper bound on g(a) from d25l l and d27l i is tight. This 
completes the proof of Lemma Q] ■ 

Corollary 1: The solution of the non-convex optimization 
problem in d22l) satisfies the inequality 

ff(a) < log(Mot - 1) 

and this bound is tight if and only if — is an integer. 

Proof: From Lemma Q] (see Eq. d23l). it follows that 

.9(a) 

<log(M-i) 

+a —J log a + ^1 — a 

= log(M- -) +log(a) 

= log(Ma - 1) 

and the above inequality turns to be an equality if and only if 
— is an integer. ■ 
By combining (f2Tb and Corollary [T] it follows that 

H(V) - H{W) < log(Ma - 1) 



log [ 1 — a 



(=->)) 



and therefore from (jT8j 

|tf(X)-if(y)| 

< d TV (X, Y) \og(Ma - 1) + /(J; X) - I(J; Y). (29) 
Finally, the bound in ( fTTT i follows from the inequality 

/(J; X) - I(J; Y) < H(J) = h(d TY (X, Y)) . (30) 

We move to derive a refinement of the bound in (TTTb when 
i < ^ < 2. In this case, the starting point is the inequality 
in d29l where it is aimed to improve the upper bound in ( 1301 ). 
To this end, 

I(J;X)-I(J;Y) 
= H{J\Y)~H{J\X) 
< H(J) - H(J\X) 

= h(dTv(X,Y)) -H(J\X) (31) 

and, from |22 Theorem 11], 

H( J\X) > 2 log 2 P( J ^ Jmap(X)) (32) 

where Jmap(X) is the maximum a-posteriori (MAP) estimator 
of J based on X (note that the minimum on the left-hand side 
of J22l Eq. (110)] is achieved by the MAP estimator). In the 
following, the estimator J MAP (X) on the right-hand side of 
is calculated. 

1) If X ^ supp(Py) then a.s. J — 1 (otherwise, J = and 
X = V, so X 6 supp(Py) a.s.). Hence, 

X <£ supp(Py) Jmap(X) = I- 

From (0, it follows that X ^ supp(Py) if and only if 
Px{X)<P Y {X). 

2) If X G supp(Py) then, from ©, P X (X) > Py(X). 
Hence, from © and ^ with p = 1 — d TV (X, y), 



Py(X) = 



Py(X) 



i - rf T v(X y) 

Px(X)-P Y (X) 



d T y(X,Y) ■ 

Since {/, V, J are independent, then from (01 

P(J= 1,1) =P(J= l)Pu(X) =P Y {X) 

P( J = 0, X) = P( J = 0) P v (X) = P x (X) - P Y (X) 

so, if X G supp(Py), then 

- / 1 if ^1*2 < iV(X)< P X (X) 

\ if Py(X) < 

To conclude, the MAP estimator of J that is based on the 
observation X is given by 



Jmap(X) 



I if^ Q <Py(X) 



ifPy(X)<£^p. 



It therefore implies that if -pj- > j whenever Px > 0, then 
^map(X) = 1 independently of X, so in this case 

P(J + .W(X)) = P(J = 0) = d T v(X, y). 
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Hence, from ( |3lT i, d32l i and the last equality, if > ~ 
whenever Px > then 

I{J;X) - J(J; Y) < h(d TV (X,Y)) - 2 log 2 • d TV {X, Y). 



A combination of the last inequality with ( 1291 finally gives 
the refined bound in dT3T >. Since it was assumed at the 
beginning of the proof that H(X) > H(Y) while it is not 
necessarily known in advance which entropy is larger, the 
requirement on can be symmetrized by requiring that 
i < jr; < 2 whenever Px , Py > 0. This completes the 
proof of Theorem [4] ■ 

Corollary 2: Let X and Y be two discrete random variables 
that take values in a set A, and let \A\ = M. Assume that for 
some positive constants Si,e 2 

1 



djy(X,Y) <ei < 1- 



Me 2 



di oc (X,Y) 
d TV (X,Y) 



<£2 < 1. 



(33) 
(34) 



Then, 



then it enables to refine the bound in [34, Proposition 1], This 
follows by combining the proof of ll34l Proposition 1] with 
d35l > (see Corollary O where Eq. d35l > replaces the use of ll49l 
Eq. (4)] in 11341 Eq. (35)]. The same thing also applies to l35l 
Proposition 2], referring to its proof in 11351 p. 305]. 

We proceed to consider the entropy difference of discrete 
random variables in a case of a countable infinite alphabet. 

Theorem 5: Let A = {oi, a 2 , . . .} be a countable infinite 
set. Let X and Y be discrete random variables where X takes 
values in the set X = {ai, . . . , a m } for some meM, and Y 
takes values in the set A. Assume that for some rji , r\ 2 , 773 > 0, 
the local and total variation distances between X and Y satisfy 

V2 < d TV (X, Y)< Vll d loc (X, Y) < 773 (36) 

where 773 < 772- Let M be an integer such that 

m 



LXJ s 

Py(aj) < 773, M > max< m + 1, 

i=M 



(1 ~ ?7l)f73 



(37) 



\H(X)-H(Y)\<e 1 log(Me 2 -l) + h(E 1 ). (35) and let 774 > satisfy 



Proof: From ( fTTT i, ( fT2l . d34l , and since a < e 2 

\H(X)-H(Y)\ < d TV {X,Y) log(Me 2 -l) + h(d T y(X,Y)). 
The function q(e) = ec + h(e) is monotonic increasing over 



the interval 

if < e < 



' l+e" 



(q'(e) = c + log (i^) > if and only 
,. Referring to the right-hand side of the 
above inequality, let c = log(M£2 — 1). so tt-^ = 1 



l+e<= 



1 

M.- 2 



Hence, if the conditions in ( l33l and d34l > are satisfied then the 
inequality in d35l l holds. ■ 

Remark 4: By considering the pair of probability mass 
functions Px,y and Px x Py (without abuse of notation, let 
H{P X ) = H{X)), then 

H(P X x Py)-H{P x ,y) 
= H(X) + H(Y) - H(X, Y) 
= I{X-Y). 

Hence, Theorem [4] and Corollary |2] provide bounds on the 
mutual information between two discrete random variables of 
finite support, where these bounds are expressed in terms of the 
local and total variation distances between the joint distribution 
of (X, Y) and the product of its marginal distributions. The 
specialization of Theorem [4] to this setting tightens the bound 
in ll49l Theorem 1], and the former bound is particularized to 
the latter known bound in the case where the local and total 
variation distances are equal (which is the extreme case). 

Remark 5: The bound in ll49l Theorem 1] was improved 
in ll34l Proposition 1] without any further assumptions. It is 
noted that by introducing the additional requirement where 
there exists some constant e 2 € [0, 1] such that for every y e y 



d\oc{Px,Px\y= y ) 

dTv(Px, Px\Y=y) 



J2 p Y(ai) logiV(oi) < 



7/4. 



(38) 



i=M 



Then, the following inequality holds: 

\H(X)-H0O\<mtog(—- 



1 +/i(77i) + 774. (39) 



Proof: Let Y be a random variable that is defined to be 
equal to Y if Y € {ai, . . . , clm-\}, and it is set to be equal to 
om if Y = ai for some i > M. Hence, the probability mass 
function of Y is related to that of Y as follows: 



Py(Oi) 



Py{ai) 



if i6 {1,..., M-l} 



(40) 



Since Px{ii) = for every i > m and also M > m+ 1 (see 
the second inequality in d37li). then it follows from (140b that 



d TY (X,Y) 

i—1 i— m-fl 

1 rn 1 00 

= ^1^(0-^)1 + 2 51 p y^) 

i—l i— m+1 

= d TV (X, Y). 



(41) 



Hence, X and Y are discrete random variables that take values 
in the set {ai, . . . , bjk} (note that it includes the set X), and 
from (f36b and ( |4TT > 



0< 77 2 <d T v(X,r)<77 1 . 



(42) 
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Furthermore, the local distance between X and Y satisfies 

dioc{X,Y) 

= max \Px(ai) - Py(a,i)\ 

ie{l,...,M} 



(a) 



{oo 
max \Px(a>i) - *V(a*)l i Py(ch) 
ie{1 ,...,M-l } ^ 



(b) 

< m&x{d ioc (X,Y),ri 3 } 

( => m (43) 

where (a), (b) and (c) above follow from the equality in d40i >. 
the first inequality in d37b and the second inequality in d36b . 
respectively. From ( l42l and ( |43l 



dTv{X,Y)< m =£i 



d loc (X,Y) < 773 



£2 



(44) 
(45) 



where < £2 < 1 (since, by assumption, < 773 < 772). The 
integer M is set to satisfy the inequality M > ( ^t m ^ (see 
the second inequality in d37b), so from (l44t and d45l > 



61 < 1- 



1 



Me 2 



Hence, it follows from Theorem [4] that 

|JTpO - fl-(y)| < m iog(— ^ 1 



'/2 



(46) 



Since y is a deterministic function of F then H(Y) > H(Y), 
and from d40t 

= F(y)-F(y) 



(00 \ / 00 N 

X)iV(Oi) log ^Py( ai ) 

00 

< - iV(a») logPy(ai) < r/ 4 . 



(47) 



i=M 



Finally, the bound in this theorem follows from d46i l. ( l47l i and 
the triangle inequality. ■ 
Corollary 3: In the setting of X and Y in Theorem [5] 
assume that djy(X, Y) < 77 for some 77 G (0,1). Let 
M = max|m + 1, jr^ j, an d assume that for some \i > 



^2 p y^) \°g p Y{ai 



< 



i=M 



then |#(X) - H(Y)\ < r}log(M - 1) + h(v) + 

Proof: This corollary follows from Theorem [5] by setting 
7/2 = 773 = 4, c (X,y) (note that di x (X,Y) < d T v(X,Y)), 
and then 771 and 774 are replaced by 77 and 71, respectively. ■ 

Remark 6: The result in Corollary [3] coincides with ll42l 
Theorem 4], which gives a bound on the entropy difference in 
terms of the total variation distance by relying on the bound 
in EU Eq. (4)] or ED Theorem 6]. 



IV. Examples 

In the following, we exemplify the use of the new bounds in 
Section|III] and also compare them with some existing bounds. 

Example 1: Let X be a discrete random variable that gets 
values in the set A = {a\, . . . , aj.;}- Lets express its arbitrary 
probability mass function in the form 

Px{a l ) = ^^- Vie {!,..., M} (48) 



where 



and 



M 

Ui £ {-1,1}, & >0, 

< 1 + < M, Mi e {!,..., M) 



M 

E 



u£i = 



where the latter equality is equivalent to Y^fLi Px(a>i) = L 

In the following, we derive a lower bound on the entropy 
H(X). Let Y be a random variable that takes the values from 
A with equal probability, so H(Y) = log M. The local and 
total variation distances between X and Y are equal to 

, M JM) 
i=l 



di oc (X, Y ) = — max & 

M l<i<M 



Smax 

M 



where ^vg ^ an d £max denote the average and maximal values 
of respectively. From (fT2b 

dtoc(x,y) 26 



OiM 



(M) 
max 



SO 



where 



d TV (X,Y) M $p 
2K M 



M 



(M) 



Savg 



(49) 



From (HB (where also H(Y) = logM > ff(X)), it follows 
that 

logM _gg icg^-D^^ < ff(x) < logM 

and, since the binary entropy function is bounded between 
and log 2, the above inequality can be loosened to 

£ { JP log(2# M - 1) log 2 ^ H(X) 



1 



< 



2 logM logM ~ logM 

which implies (since Km > 1) that 



< 1 



lim ^ g/) logKM 







H(X) 
lim —^-4 = 1. 



M->-oo logM M->oo logM 

For comparison, the bound in Theorem [3] gives that 



(50) 



(51) 



1 - 



$f log(M-l) 



1 



logM 



log M 



(AO 



logM 



(52) 
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which implies that 
lim 







1. 



(53) 



The latter condition in (T53T > is strictly stronger than ( BTT i. To 
see this, note that 1 < K M < 4f (since < %Z\xV) - 



On the other hand, as a concrete example for the case where 
the condition in (fSTb holds whereas the condition in d53l does 
not hold, let M be an arbitrary even number, and 

«* = (-!)*, 6=j8e(0,i], ie{i,...,M} 



where, indeed, X^=i ""^i = /^2i=i( — 1)' = 0. In this 
case, Px{ai) = -n# for odd numbers i 6 {1, . . . , M}, and 
Px(a>i) = ^jj- for even numbers i. Furthermore, in this 
case Km = 1 for every even M, so the condition in (IBTl i 
holds by letting the even number M tend to infinity. On 
the other hand, the condition in d53l l is not satisfied since 



P > 0. The upper and lower bounds in 

P 



iimj/_ ) . 00 ^ avg 

tend to 1 and 1 — §, respectively, so the gap between these 
asymptotic bounds is increased linearly with /3, Therefore, 
Theorem [4] gives a simple lower bound on the entropy H(X) 
in terms of the average and maximal values of {£j}|£x> which 
improves the lower bound on the entropy that follows from 
the known bound in Theorem [3] (see d52b ). 

For comparison, the bound in ||2T1 Theorem 7] is also 
applied to this example. In this case, since Px , Py < 1+ ^"* x 
then Px and Py are less than or equal to with Nm = 

M 



. Similarly to the above analysis, it is easy to verify 



from |21 Theorem 7] that 

Savg lug^maxl 



lim 



log A// 



r H(X) 
lim — 

M^oo log M 



(54) 



Since 



Savg 



io g (e 



(M)n 
max j 



& iog(e 



logAf 



> 



(M)\ 
avg , 



> - 



loge 



log M ~ e log M 
where the right-hand side of this inequality holds since the 
function f(x) — xloga; for x > achieves its minimal value 
at x = -, it follows that if the limit on the left-hand side of 

e 

d54| i is zero then also 



lim 

M-s-o. 



AM) 



iog(e 



(M)\ 
avg t 



= 0. 



log M 

Therefore, the definition of Km in d49] > gives that 



lim 

M->a 



QvP \ogK M 



lim 

M->cx 

0. 



logM 

Savg 



(M) io g (d A £) 



logM 



lim 

M->-oc 



logoff*) 
logM 



This shows that the conclusion in ( f5TT > implies the one in ( 1541 . 

A special case of ( 148 1) w/?/z numerical results: As a special 
case of the probability mass function in (08]), let M = 2 m for 
some m e N, let = (-1)* for i 6 {1, ... , M }, and & = /3 
for some /? € [0,1]. In this special case, 




(l-/?) ifie{i,3,...,2 m -i} 

(1 + /3) if ie {2,4,. ..,2™}. 



Let Y be a random variable that gets all the values in the 
set {ai, . . . , cim} with equal probability (i.e., 2~ m ). Then, the 
local and total variation distances between X and Y are 

d loc (x,r) = A ; dTvix ,Y) = ^ 

so, from (fl2l . a = j*. The entropies of X and F are 



= (rn- l)log2 + /i 



1-/3 



ff(y) = to log 2 



/ifirrH independently of to. 



so, H{Y)-H(X) =log2 , 

For comparison, the known bound in Theorem [3] that only 
depends on the total variation distance between X and Y (with 
no further knowledge about their probability mass functions) 
gives 



H(Y)-H(X)<^-log2 



log(l 



') 



so this upper bound increases almost linearly with to, in 
contrast to the exact value that is independent of to. The 
new bound in ( fTTI ). which depends on both the local and total 
variation distances between X and Y (but again, without any 
further information on their probability mass functions) gives 

H(Y) - H(X) 

< d TV (X, Y) log(Afa - 1) + h(d TV (X, Y)) 

= h(l). (55) 

Similarly to the exact value, but in contrast to the former 
bound, the latter bound is independent of to. Furthermore, if 
/3 — > and m/3 — > oo, then the exact value of H(Y) — H(X) 
as well as the latter bound (that follows from Theorem [4]l 
tend to zero, whereas the former bound that follows from 
Theorem[3]tends to infinity. This shows the difference between 
the two bounds, exemplifying the possible advantage of taking 
into account the local distance in addition to the total variation 
distance. 

For p £ [0, §], the condition ± < ^ < 2 is fulfilled, so 
the tightened bound in ( fT3l gives that 



< H(Y) - H(X) < 



/3 1og2. 



(56) 



If p = |, H(Y) - H(X) = log2 - h(\) = 0.131 nats, the 
upper bound in (T55l is equal to 0.562 nats, and the tightened 
version of this bound in ( T56i l is equal to 0.216 nats. 

It is noted that since Px is majorized by Py (see lf22l 
Definition 1 on p. 5934]), then according to lf22l Theorem 3] 

H(Y)-H(X)>D(Px\\Py) 

and since Py refers to a uniform distribution over a set of 
cardinality M = 2 m then H(Y) = to log 2, and 

D(P x \\Py) =mlog2 -H(P x ) 

so, the above lower bound is achieved here with equality. 

In Example [Tj the probability mass function of the discrete 
random variable X was known explicitly. However, in many 
interesting applications, this is not necessarily the case. If 
the exact distribution of X is not available or is numerically 
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hard to compute, a derivation of some good bounds on the 
local and total variation distances between X and another 
random variable Y with a known probability mass function 
can be valuable to get a rigorous bound on the difference 
\H{X) - H(Y)\ via Theorems g] or As a result of the 
calculation of such a bound on the entropy difference, it 
provides bounds on the entropy of X in terms of another 
entropy (the entropy of Y) which is assumed to be easily 
calculable. For example, assume that X = 2~27=i Xi * s 
expressed as a sum of Bernoulli random variables that are 
either independent or weakly dependent, and may be also non- 
identically distributed. Let Xj ~ Bernoulli^), and assume 
that Y^i=iPi = A where all of the pi's are much smaller 
than 1 . In this case, the approximation of X by a Poisson dis- 
tribution with mean A (according to the law of small numbers 
[24 1) raises the question: How close is H{X) to the entropy 
of the Poisson distribution with mean A ? (note that the latter 
entropy of the Poisson distribution is calculated efficiently 
in 0). This question is especially interesting because the 
support of the Poisson distribution is the infinite countable set 
of non-negative integers, and the entropy is known not to be 
continuous when the support is not finite; hence, a small total 
variation distance does not in general yield a small difference 
between the two entropies. This question was addressed in [42, 
Section 2] via the use of Corollary [3] (which coincides with 
[42, Theorem 4]), combined with an upper bound on the total 
variation distance between X and Y where the latter bound 
is calculated via the use of the Chen-Stein method (see, e.g., 
11401 Chapter 2]). 

Example 2: In the following, we wish to tighten the bounds 
on the entropy of a sum of independent Bernoulli random 
variables that are not necessarily identically distributed. The 
bound provided in [42 Proposition 1] relies on an upper bound 
on the total variation distance between this sum and a Poisson 
random variable with the same mean (see ||4] Theorem 1] or J5] 
Theorem 2.M]). In order to tighten the bound on the entropy in 
the considered setting, we further rely on a lower bound on the 
total variation distance (see [42 , Theorem 6 and Corollary 2]) 
and an upper bound on the local distance (see [2] Theorem 2.Q 
and Corollary 9. A. 2]). The latter two bounds provide an upper 
bound on the ratio of the local and total variation distances, 
which enables to apply the bound in Theorem |5J it improves 
the bound in Corollary|3]which solely relies on an upper bound 
on the total variation distance. It is noted that the latter looser 
bound, which relies on Corollary [5] (or 11421 Theorem 4]) was 
used in l42l Section 2] for estimating the entropy of a sum of 
Bernoulli random variables in the more general setting where 
the summands are possibly dependent. 

Let X = 2~2?=i Xi be a sum of independent Bernoulli ran- 
dom variables with Xi ~ Bernoulli^ ) for i £ {1, . . . , n}. Let 
2~2i=i Pi — A, an d l et ^ ~ P°W b e a Poisson random variable 
with mean A. From [4, Theorem 1] (or [5, Theorem 2.M]), the 
following upper bound on the total variation distance holds: 



bound on the total variation distance holds: 



d TV (X,Y)>kJ2'i 



where 



u A 1 e (3 + L 



2A 6» + 2e- 1 /2 



(58) 



(59) 



9 = 3 + j + j ■ v /(3A + 7)[(3 + 2e-V2)A + 7]. (60) 

An upper bound on the local distance between a sum 
of independent Bernoulli random variables and a Poisson 
distribution with the same mean A follows as a special case of 
J5] Corollary 9. A. 2] by setting I = 1 (so that the distribution 
Qi in this corollary is specialized for I = 1 to the Poisson 
distribution Po(A), according to [5, Eq. (1.12) on p. 177]). 
Since the upper bound on the right-hand side of the inequality 
in J5] Corollary 9. A. 2] does not depend on the (time) index 
j, it follows that the same bound also holds while referring 
to d l0C (X,Y) 4 sup jeN JP(X = j) -Po(A){j}|. Based 
on the notation used in this corollary, it implies that if 
2~27=i Pi — I t ^ len tne l° ca l distance between a sum 

Bernoulli^) 

En 
i=lPi 1S 



of independent Bernoulli random variables Xi <■» 
and a Poisson random variable with mean A 
upper bounded by 



dioc(X, Y) 

< 4 (2 maxP(Y = j)) i^—^- 



— A 




1=1 
1 - e- x 



( a ) / 2 

< 4 min<! 2e- x I (X) . , 

eA V A 



(61) 



i=i 



where inequality (a) holds due to f5] Proposition A. 2. 7 on 
pp. 262-263], and Io denotes the modified Bessel function 
of order zero. Since an upper bound on the total variation 
distance also forms an upper bound on the local distance, then 
a combination of d57l i and doTT i gives that 



dlac(X,Y) 



< min < 1.4 



, 8e- A / (A) 



1- e 



^— £*?■ (62) 

' 2 — 1 



We now apply Theorem [5] to get rigorous bounds on the 
entropy H(X) by estimating how close it is to i?(Po(A)). 
Note that the improvement in the tightness of the bound in 
Theorem [5] in comparison to the looser bound in Corollary [3] 
is more remarkable when the ratio a of the local and total 
variation distances is close to zero. This happens to be the 
case if A ^> 1 where due to the asymptotic expansion of Iq 
(see fl] Eq. (9.7.1) on p. 377] or El Eq. (8.451.5) on p. 973]) 



d T v(X,Y) < 



1 



i=i 



(57) 



4(A) 




if A » 1 



Furthermore, from ll42l Corollary 2], the following lower one gets from Eqs. (T5"8"l)-(l60ll and d62b . combined with the 
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limit in flU Eq. (149)], that 



a 



rfiocpr.y) 

d T v(X, Y) 



(if A>1) 
< 



A V A 



(W 1 + !- e_1/2 )" (^)E^ 




33.634 



(63) 



so, for large values of A, the upper bound on the parameter a 
in ( fl2b decays to zero like the square-root of j. 

As a possible application, consider a noiseless binary- 
adder multiple-access channel (MAC) with n independent 
users where each user transmits binary symbols, and the 
channel output is the algebraic sum of the input symbols. 
The capacity region of this MAC channel is an n-dimensional 
polyhedron. One feature of this capacity region is the sum of 
the rates that is given by i?suM = Ri> m ^ ^ ^ s u PP er 

bounded by the joint mutual information between the input 
symbols X\ , . . . , X n and the corresponding channel output 
y El : A,, i.e.. 



Rsvm < max 

Px:Px=Px, ---Px 



I{X\, . . . , X n ; Y) 



where, since the MAC is noiseless and the output symbol is 
the sum of the n input symbols then H(Y\Xi, . . . , X n ) = 0, 
and therefore I{X 1 , . . . , X n ; Y) = H{Y)U Hence, in the 
considered setting, the maximal sum rate is the maximal 
entropy of the sum of n independent binary random variables 
where Xi ~ Bernoulli(pi) for i E {1, ...,n}. Under the 
constraint that 2~Z"=i ^[Xi] < A, it follows from the maximal 
entropy result in ifTTl . [23 1 and [43 1 that the entropy of Y 
is maximized when the n independent inputs are i.i.d. with 
mean p = —, and consequently the channel output Y is 
Binomially distributed with Y ~ Binom(n, -^). For a very 
large number of users, the calculation of the entropy of the 
Binomial distribution is difficult, and it would be much easier 
to calculate the entropy i/(Po(A)) for a Poisson distribution 
with mean A (see (2)). 

In the following, we make use of Theorem[5]to get an upper 
bound on the entropy difference 



A. 



H (Po(A)) - H[Bmom(n, - 



(64) 



where, due to the maximal entropy result for the Poisson 
distribution (see, e.g., [17], [23 1 or [43 1), this difference is 
positive. Let X ~ Binom(n, -) be a sum of n i.i.d. Bernoulli 
random variables with probability of success p = — , and let 
Y ~ Po(A). From (BTT i. the total variation distance in this case 

1 The reader is referred to 1 6 ] for the consideration of the sum-rate for two 
noiseless multiple-access channels with some similarity to the binary adder 
channel, see footnote in p. 43]. 



is upper bounded by 

dMx,Y)< A ( 1 ~ e ' A ) A 

n 

From d58l l and d59l ), the following inequality holds: 



d TV (X,Y) > 



= m 



(65) 



(66) 



2 9 + 2e-V2 n 

where 6 is given in ( f60b . Furthermore, for using Theorem |5] 
one needs an upper bound on the local distance between the 
Poisson and Binomial distributions. Eq. d62l ) gives that 



rfloc(^,^) 
< min < 1,4 



8e- A io(A) 



A 1 



= V3- (67) 
1. 



Following the notation in Theorem|5j it follows that m = n 
From P7t . one needs to choose an integer M such that 



M > max< n 



'12 



and 



j=M 



' ^3(1 - m) 
n A (i) < m 



(68) 



(69) 



where II A (j) 



X 3 



for j e No designates the probability 



mass function of Po(A). Based on Chernoff's inequality, 

00 

J2 n A (i) = p(y >m) 



< exp 



A + M In 



Ae 



(70) 



Let M > Ae 2 , then it follows from (|69]l and (|70ll that it is 
sufficient for M to satisfy the condition exp(— (A+Af)) < 773. 
Combining it with d68l ) leads to the following possible choice 
of M: 



M = max< n 



V2 



Ae 2 , In 



- A 



(71) 



where 771, 772 and 773 are introduced in ( 165) , d66l ), and ( f67T > 
respectively. Finally, for the use of Theorem [5] one needs to 
choose 774 > such that Y,JLm{~ u ^U) l°g( n A(j))} < 
From the analysis in Il42l Eqs. (43)-(47)], it follows from the 
last inequality and [42 Eq. (47)] that 774 here is equal to /j, in 
H12 Eq. (23)], i.e., 

61og(2vr) + 1 



A 

774 = 



+ A 2 + 



exp 



A + (M - 2) lo{ 



12 
/M-2 
I Ae 



(72) 



where M is introduced in 1711 . and (x)+ = maxfa;, 0} for 
every x € R. At this stage, we are ready to apply Theorem [5] 
to derive a bound on the non-negative difference between the 
entropies in d64l ). From Theorem |3J it follows that 

< H(Po(X}) - ij(Binom(n, -) 



Km 



774. 



(73) 
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For comparison, it follows from Corollary [3] that the upper 
bound on the right-hand side of d73l is replaced by 

7?i log(M - 1) + + 974 (74) 

where 

M = max(n + 2, ). (75) 

I 1 - 77! J 

Note that the bound in (F73T > improves the bound in d74l i if 
Vi < V2 (i- e -> if the upper bound on the local distance is 
smaller than the lower bound on the total variation distance). 
Furthermore, the latter bound does not take into account the 
parameters 772 and 773. As a numerical example, for n = 10 6 
and p = 0.1, lets check the bound on the entropy difference 
in <E§ for A = np (i.e., A = 10 5 ). Eqs. (ED, dZ2> 

and {75]) yield that 

m = HT 1 , 77 2 = 9.5 • 1(T 3 , 773 = 1.0 ■ 10~ 3 , 7/4 « 0, 
M = M = 10 6 + 2 

and the two bounds in (T73T > and d74"l i are, respectively, equal to 
1.483 and 1.707 nats, respectively. The value of H(Po(X)) is 
7.175 nats, so the entropy H(Binom(n, ^-)) ranges between 
5.693 to 7.175 nats. Note that for n = 10 6 ™and A = 10 4 , where 
p = — is decreased from 10 _1 to 10~ 2 , the upper bounds on 
( l64l are decreased, respectively, to 0.183 and 0.194 nats, and 
i/(Po(A)) = 6.024 nats. The Poisson approximation is more 
accurate in the latter case, consistently with the law of small 
numbers (see, e.g., 11241 1. 

Remark 7: Example [2] considers the use of Theorem [5] 
for the estimation of the entropy of a sum of independent 
Bernoulli random variables. The more general case of the 
estimation of the entropy (via rigorous bounds) for a sum of 
possibly dependent Bernoulli random variables was considered 
in ll42l Section II] by using the looser bound in Corollary [3] 
with an upper bound on the total variation distance that follows 
from the Chen-Stein method (see [3 Theorem 1]). It is noted 
that, in principle, also the sharper bound in Theorem [5] can 
be applied to obtain bounds on the entropy for a sum of 
possibly dependent Bernoulli random variables. To this end, in 
addition to the upper bound on the total variation distance in 
[3] Theorem 1], one needs to rely on a lower bound on the total 
variation distance (see ||5] Chapter 3]) and an upper bound on 
the local distance (see (5] Theorem 2.Q on p. 42]). It is noted, 
however, that these distance bounds are much simplified in the 
setting of independent summands (see Example |2). 

Remark 8: The Chen-Stein method for the Poisson approx- 
imation was adapted in |28l to the setting of the geometric 
distribution, and it yields a convenient method for assessing the 
accuracy of the geometric approximation to the distribution of 
the number of failures preceding the first success in dependent 
trials. A recent study of upper bounds on the total variation and 
local distances for the geometric approximation (respectively, 
denoted by d\ and c?2 in 11291 ) enables to apply the entropy 
bounds in Theorem [5] and Corollary [3] in a conceptually 
similar way to Example [2] Furthermore, the entropy bound 
in Corollary [3] can be applied to compound geometric and 
negative binomial approximations, based on upper bounds 
on the total variation distance that were derived via Stein's 
method in lfl2ll and [45 1, respectively. 
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